I always found it difficult to find perfectly organized, clean data on online websites. Because of this I decided to learn how to scrape webpages for the specific things that I needed. Whatever table, paragraph, or hidden image I wanted, I could grab off and import into R studio. From there, I can better clean and rearrange the data to my specific needs, producing a perfect image or data frame.
The fist step in web scraping is reading in a webpage. This is done simply by: read_html(“webpage”). Once a webpage is read in, you can begin gathering data. The easiest way to import any data is to search for tables within a website: html_tables(), this gives you an output of every table within a webpage that you can then search through. By specifying a table: table <- tables[[1]], you can import an entire data set into R studio and begin your data manipulation.
If you are searching though paragraphs or images for multiple, specific terms you must use an HTML element. By downloading a plug-in titled: SelectorGadget, you are able to highlight any part of a webpage and grab the CSS or XPath element. When scraping an HTML element, you must read it in through its “path”. This allows R to scan an HTML document and look for the specific data that you want.
If you are curious where these paths are located, you can inspect an element. This brings up the actual HTML website code.
A typical website will have either a built “table”, paragraph or a type of image explaining a certain topic. This is difficult to manipulate and gain any actual data from. My first website contained certain demographics about each world country. It looked a little like this:
From the combined techniques listed above, I was able to import a list of countries, populations, land area, and density as a data.frame into R stuido.
The data that I collected from this webpage was in rough shape that required some cleaning and rearranging. Here is some code used:
The end product of my first web scraping is a scroll table that can be saved as an HTML document and imported into any website or presentation with listed Country and 2020 populations:
| Country | Population (2020) |
|---|---|
| Afghanistan | 38928346 |
| Albania | 2877797 |
| Algeria | 43851044 |
| Andorra | 77265 |
| Angola | 32866272 |
| Antigua and Barbuda | 97929 |
| Argentina | 45195774 |
| Armenia | 2963243 |
| Australia | 25499884 |
| Austria | 9006398 |
| Azerbaijan | 10139177 |
| Bahamas | 393244 |
| Bahrain | 1701575 |
| Bangladesh | 164689383 |
| Barbados | 287375 |
| Belarus | 9449323 |
| Belgium | 11589623 |
| Belize | 397628 |
| Benin | 12123200 |
| Bhutan | 771608 |
| Bolivia | 11673021 |
| Bosnia and Herzegovina | 3280819 |
| Botswana | 2351627 |
| Brazil | 212559417 |
| Brunei | 437479 |
| Bulgaria | 6948445 |
| Burkina Faso | 20903273 |
| Burundi | 11890784 |
| Côte d'Ivoire | 26378274 |
| Cape Verde | 555987 |
| Cambodia | 16718965 |
| Cameroon | 26545863 |
| Canada | 37742154 |
| Central African Republic | 4829767 |
| Chad | 16425864 |
| Chile | 19116201 |
| China | 1439323776 |
| Colombia | 50882891 |
| Comoros | 869601 |
| Congo [DRC] | 5518087 |
| Costa Rica | 5094118 |
| Croatia | 4105267 |
| Cuba | 11326616 |
| Cyprus | 1207359 |
| Czech Republic | 10708981 |
| Congo [Republic] | 89561403 |
| Denmark | 5792202 |
| Djibouti | 988000 |
| Dominica | 71986 |
| Dominican Republic | 10847910 |
| Ecuador | 17643054 |
| Egypt | 102334404 |
| El Salvador | 6486205 |
| Equatorial Guinea | 1402985 |
| Eritrea | 3546421 |
| Estonia | 1326535 |
| Swaziland | 1160164 |
| Ethiopia | 114963588 |
| Fiji | 896445 |
| Finland | 5540720 |
| France | 65273511 |
| Gabon | 2225734 |
| Gambia | 2416668 |
| Georgia | 3989167 |
| Germany | 83783942 |
| Ghana | 31072940 |
| Greece | 10423054 |
| Grenada | 112523 |
| Guatemala | 17915568 |
| Guinea | 13132795 |
| Guinea-Bissau | 1968001 |
| Guyana | 786552 |
| Haiti | 11402528 |
| Vatican City | 801 |
| Honduras | 9904607 |
| Hungary | 9660351 |
| Iceland | 341243 |
| India | 1380004385 |
| Indonesia | 273523615 |
| Iran | 83992949 |
| Iraq | 40222493 |
| Ireland | 4937786 |
| Israel | 8655535 |
| Italy | 60461826 |
| Jamaica | 2961167 |
| Japan | 126476461 |
| Jordan | 10203134 |
| Kazakhstan | 18776707 |
| Kenya | 53771296 |
| Kiribati | 119449 |
| Kuwait | 4270571 |
| Kyrgyzstan | 6524195 |
| Laos | 7275560 |
| Latvia | 1886198 |
| Lebanon | 6825445 |
| Lesotho | 2142249 |
| Liberia | 5057681 |
| Libya | 6871292 |
| Liechtenstein | 38128 |
| Lithuania | 2722289 |
| Luxembourg | 625978 |
| Madagascar | 27691018 |
| Malawi | 19129952 |
| Malaysia | 32365999 |
| Maldives | 540544 |
| Mali | 20250833 |
| Malta | 441543 |
| Marshall Islands | 59190 |
| Mauritania | 4649658 |
| Mauritius | 1271768 |
| Mexico | 128932753 |
| Micronesia | 548914 |
| Moldova | 4033963 |
| Monaco | 39242 |
| Mongolia | 3278290 |
| Montenegro | 628066 |
| Morocco | 36910560 |
| Mozambique | 31255435 |
| Myanmar [Burma] | 54409800 |
| Namibia | 2540905 |
| Nauru | 10824 |
| Nepal | 29136808 |
| Netherlands | 17134872 |
| New Zealand | 4822233 |
| Nicaragua | 6624554 |
| Niger | 24206644 |
| Nigeria | 206139589 |
| North Korea | 25778816 |
| Macedonia [FYROM] | 2083374 |
| Norway | 5421241 |
| Oman | 5106626 |
| Pakistan | 220892340 |
| Palau | 18094 |
| Palestinian Territories | 5101414 |
| Panama | 4314767 |
| Papua New Guinea | 8947024 |
| Paraguay | 7132538 |
| Peru | 32971854 |
| Philippines | 109581078 |
| Poland | 37846611 |
| Portugal | 10196709 |
| Qatar | 2881053 |
| Romania | 19237691 |
| Russia | 145934462 |
| Rwanda | 12952218 |
| Saint Kitts and Nevis | 53199 |
| Saint Lucia | 183627 |
| Saint Vincent and the Grenadines | 110940 |
| Samoa | 198414 |
| San Marino | 33931 |
| São Tomé and Príncipe | 219159 |
| Saudi Arabia | 34813871 |
| Senegal | 16743927 |
| Serbia | 8737371 |
| Seychelles | 98347 |
| Sierra Leone | 7976983 |
| Singapore | 5850342 |
| Slovakia | 5459642 |
| Slovenia | 2078938 |
| Solomon Islands | 686884 |
| Somalia | 15893222 |
| South Africa | 59308690 |
| South Korea | 51269185 |
| Sudan | 11193725 |
| Spain | 46754778 |
| Sri Lanka | 21413249 |
| Sudan | 43849260 |
| Suriname | 586632 |
| Sweden | 10099265 |
| Switzerland | 8654622 |
| Syria | 17500658 |
| Tajikistan | 9537645 |
| Tanzania | 59734218 |
| Thailand | 69799978 |
| Timor-Leste | 1318445 |
| Togo | 8278724 |
| Tonga | 105695 |
| Trinidad and Tobago | 1399488 |
| Tunisia | 11818619 |
| Turkey | 84339067 |
| Turkmenistan | 6031200 |
| Tuvalu | 11792 |
| Uganda | 45741007 |
| Ukraine | 43733762 |
| United Arab Emirates | 9890402 |
| United Kingdom | 67886011 |
| United States | 331002651 |
| Uruguay | 3473730 |
| Uzbekistan | 33469203 |
| Vanuatu | 307145 |
| Venezuela | 28435940 |
| Vietnam | 97338579 |
| Yemen | 29825964 |
| Zambia | 18383955 |
| Zimbabwe | 14862924 |
I wanted to make a more interactive display of the previous list of countries and so I chose to create a map. To complete this task, I needed the latitude and longitude of every country. Scraping the below table, I imported this information into an R data frame.
Through R code of cleaning, combining, and creating, I was able to produce an interactive map of the world. This map allows you to view every country and when clicked upon, displays the 2020 populations.
htmltools::includeHTML("Second_Webpage/map.html")
I wanted to make a more interactive display of the previous list of countries and so I chose to create a map. To complete this task, I needed the lat and long of every country. Scraping the below table, I imported this information into an R data frame.
Through R code of cleaning, combining, and creating, I was able to produce an interactive map of the world. This map allows you to view every country and when clicked upon, displays the 2020 population.
the below r chunk will be the map part. I am attempting to do this with htmltools::includeHTML(“Second_Webpage/map.html”) though it is not working. I have deleted it and put in this message so I can continue working. I will come back to this
>>>>>>> bf11315d990e756552b343fbedbe726a19fc8ef6For my final push of learning web scraping, I wanted to be a little more creative. Scraping a list of the top most visited countries and tourist attractions I was able to create an interactive plot.
The website with the above information was a little more tricky. Using the HTML inspector code, I was able to pinpoint the top visited countries within the websites map and scrape the data into R studio.
<<<<<<< HEAD
This was then combined through R code with a second CSV file of top visited tourist attractions to create the below table. This graphs shows the amount of tourist arrivals and when each bar is hovered over, the top tourist attraction in each country.
this bit of code freezes my webpage? Deleting for now but it was “htmltools::includeHTML(”Fourth_webpage/p.html")’
>>>>>>> bf11315d990e756552b343fbedbe726a19fc8ef6Not only was I able to master the art of web scraping but I also learned some valuable packages such as XML2, rvest, janitor, KableExtra, HTMLWidgets, plotly